46 research outputs found
Cali-Sketch: Stroke Calibration and Completion for High-Quality Face Image Generation from Poorly-Drawn Sketches
Image generation task has received increasing attention because of its wide
application in security and entertainment. Sketch-based face generation brings
more fun and better quality of image generation due to supervised interaction.
However, When a sketch poorly aligned with the true face is given as input,
existing supervised image-to-image translation methods often cannot generate
acceptable photo-realistic face images. To address this problem, in this paper
we propose Cali-Sketch, a poorly-drawn-sketch to photo-realistic-image
generation method. Cali-Sketch explicitly models stroke calibration and image
generation using two constituent networks: a Stroke Calibration Network (SCN),
which calibrates strokes of facial features and enriches facial details while
preserving the original intent features; and an Image Synthesis Network (ISN),
which translates the calibrated and enriched sketches to photo-realistic face
images. In this way, we manage to decouple a difficult cross-domain translation
problem into two easier steps. Extensive experiments verify that the face
photos generated by Cali-Sketch are both photo-realistic and faithful to the
input sketches, compared with state-of-the-art methodsComment: 10 pages, 12 figure
Specialist or Generalist? Instruction Tuning for Specific NLP Tasks
The potential of large language models (LLMs) to simultaneously perform a
wide range of natural language processing (NLP) tasks has been the subject of
extensive research. Although instruction tuning has proven to be a
data-efficient method for transforming LLMs into such generalist models, their
performance still lags behind specialist models trained exclusively for
specific tasks. In this paper, we investigate whether incorporating
broad-coverage generalist instruction tuning can contribute to building a
specialist model. We hypothesize that its efficacy depends on task specificity
and skill requirements. Our experiments assess four target tasks with distinct
coverage levels, revealing that integrating generalist instruction tuning
consistently enhances model performance when the task coverage is broad. The
effect is particularly pronounced when the amount of task-specific training
data is limited. Further investigation into three target tasks focusing on
different capabilities demonstrates that generalist instruction tuning improves
understanding and reasoning abilities. However, for tasks requiring factual
knowledge, generalist data containing hallucinatory information may negatively
affect the model's performance. Overall, our work provides a systematic guide
for developing specialist models with general instruction tuning. Our code and
other related resources can be found at
https://github.com/DavidFanzz/Generalist_or_Specialist.Comment: Accepted to EMNLP 202
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation
In this work, we propose TediGAN, a novel framework for multi-modal image
generation and manipulation with textual descriptions. The proposed method
consists of three components: StyleGAN inversion module, visual-linguistic
similarity learning, and instance-level optimization. The inversion module maps
real images to the latent space of a well-trained StyleGAN. The
visual-linguistic similarity learns the text-image matching by mapping the
image and text into a common embedding space. The instance-level optimization
is for identity preservation in manipulation. Our model can produce diverse and
high-quality images with an unprecedented resolution at 1024. Using a control
mechanism based on style-mixing, our TediGAN inherently supports image
synthesis with multi-modal inputs, such as sketches or semantic labels, with or
without instance guidance. To facilitate text-guided multi-modal synthesis, we
propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real
face images and corresponding semantic segmentation map, sketch, and textual
descriptions. Extensive experiments on the introduced dataset demonstrate the
superior performance of our proposed method. Code and data are available at
https://github.com/weihaox/TediGAN.Comment: CVPR 2021. Code: https://github.com/weihaox/TediGAN Data:
https://github.com/weihaox/Multi-Modal-CelebA-HQ Video:
https://youtu.be/L8Na2f5viA
Domain Fingerprints for No-reference Image Quality Assessment
Human fingerprints are detailed and nearly unique markers of human identity.
Such a unique and stable fingerprint is also left on each acquired image. It
can reveal how an image was degraded during the image acquisition procedure and
thus is closely related to the quality of an image. In this work, we propose a
new no-reference image quality assessment (NR-IQA) approach called domain-aware
IQA (DA-IQA), which for the first time introduces the concept of domain
fingerprint to the NR-IQA field. The domain fingerprint of an image is learned
from image collections of different degradations and then used as the unique
characteristics to identify the degradation sources and assess the quality of
the image. To this end, we design a new domain-aware architecture, which
enables simultaneous determination of both the distortion sources and the
quality of an image. With the distortion in an image better characterized, the
image quality can be more accurately assessed, as verified by extensive
experiments, which show that the proposed DA-IQA performs better than almost
all the compared state-of-the-art NR-IQA methods.Comment: accepted by IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT
Match4Rec: A Novel Recommendation Algorithm Based on Bidirectional Encoder Representation with the Matching Task
Characterizing users' interests accurately plays a significant role in an
effective recommender system. The sequential recommender system can learn
powerful hidden representations of users from successive user-item interactions
and dynamic users' preferences. To analyze such sequential data, conventional
methods mainly include Markov Chains (MCs) and Recurrent Neural Networks
(RNNs). Recently, the use of self-attention mechanisms and bi-directional
architectures have gained much attention. However, there still exists a major
limitation in previous works that they only model the user's main purposes in
the behavioral sequences separately and locally, and they lack the global
representation of the user's whole sequential behavior. To address this
limitation, we propose a novel bidirectional sequential recommendation
algorithm that integrates the user's local purposes with the global preference
by additive supervision of the matching task. We combine the mask task with the
matching task in the training process of the bidirectional encoder. A new
sample production method is also introduced to alleviate the effect of mask
noise. Our proposed model can not only learn bidirectional semantics from
users' behavioral sequences but also explicitly produces user representations
to capture user's global preference. Extensive empirical studies demonstrate
our approach considerably outperforms various state-of-the-art models.Comment: Accepted by ICONIP202
Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition
Current methods for few-shot action recognition mainly fall into the metric
learning framework following ProtoNet. However, they either ignore the effect
of representative prototypes or fail to enhance the prototypes with multimodal
information adequately. In this work, we propose a novel Multimodal
Prototype-Enhanced Network (MORN) to use the semantic information of label
texts as multimodal information to enhance prototypes, including two modality
flows. A CLIP visual encoder is introduced in the visual flow, and visual
prototypes are computed by the Temporal-Relational CrossTransformer (TRX)
module. A frozen CLIP text encoder is introduced in the text flow, and a
semantic-enhanced module is used to enhance text features. After inflating,
text prototypes are obtained. The final multimodal prototypes are then computed
by a multimodal prototype-enhanced module. Besides, there exist no evaluation
metrics to evaluate the quality of prototypes. To the best of our knowledge, we
are the first to propose a prototype evaluation metric called Prototype
Similarity Difference (PRIDE), which is used to evaluate the performance of
prototypes in discriminating different categories. We conduct extensive
experiments on four popular datasets. MORN achieves state-of-the-art results on
HMDB51, UCF101, Kinetics and SSv2. MORN also performs well on PRIDE, and we
explore the correlation between PRIDE and accuracy